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ABSTRACT 

Motivation: Alternative splicing (AS) is a regulated process that directs 
the generation of different transcripts from single genes. A computa- 
tional model that can accurately predict splicing patterns based on 
genomic features and cellular context is highly desirable, both in 
understanding this widespread phenomenon, and in exploring the 
effects of genetic variations on AS. 

Methods: Using a deep neural network, we developed a model 
inferred from mouse RNA-Seq data that can predict splicing patterns 
in individual tissues and differences in splicing patterns across tissues. 
Our architecture uses hidden variables that jointly represent features in 
genomic sequences and tissue types when making predictions. 
A graphics processing unit was used to greatly reduce the training 
time of our models with millions of parameters. 
Results: We show that the deep architecture surpasses the perform- 
ance of the previous Bayesian method for predicting AS patterns. With 
the proper optimization procedure and selection of hyperparameters, 
we demonstrate that deep architectures can be beneficial, even with a 
moderately sparse dataset. An analysis of what the model has learned 
in terms of the genomic features is presented. 
Contact: frey@psi.toronto.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

Alternative splicing (AS) is a process whereby the exons of a 
primary transcript may be connected in different ways during 
pre-mRNA splicing. This enables the same gene to give rise to 
splicing isoforms containing different combinations of exons, 
and as a result different protein products, contributing to the 
cellular diversity of an organism (Wang and Burge, 2008). 
Furthermore, AS is regulated during development and is often 
tissue dependent, so a single gene can have multiple tissue-spe- 
cific functions. The importance of AS lies in the evidence that at 
least 95% of human multi-exon genes are alternatively spliced 
and that the frequency of AS increases with species complexity 
(Barbosa-Morais et aL, 2012; Pan et aL, 2008). 

One mechanism of splicing regulation occurs at the level of the 
sequences of the transcript. The presence or absence of certain 
regulatory elements can influence which exons are kept, while 
others are removed, before a primary transcript is translated 
into proteins. Computational models that take into account 
the combinatorial effects of these regulatory elements have 
been successful in predicting the outcome of splicing (Barash 
et aL, 2010). 
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Previously, a 'splicing code' that uses a Bayesian neural net- 
work (BNN) was developed to infer a model that can predict the 
outcome of AS from sequence information in different cellular 
contexts (Xiong et aL, 2011). One advantage of Bayesian meth- 
ods is that they protect against overfitting by integrating over 
models. When the training data are sparse, as is the case for 
many datasets in the life sciences, the Bayesian approach can 
be beneficial. It was shown that the BNN outperforms several 
common machine learning algorithms, such as multinomial lo- 
gistic regression (MLR) and support vector machines, for AS 
prediction in mouse trained using microarray data. 

There are several practical considerations when using BNNs. 
They often rely on methods like Markov Chain Monte Carlo 
(MCMC) to sample models from a posterior distribution, 
which can be difficult to speed up and scale up to a large 
number of hidden variables and a large volume of training 
data. Furthermore, computation-wise, it is relatively expensive 
to get predictions from a BNN, which requires computing the 
average predictions of many models. 

Recently, deep learning methods have surpassed the state-of- 
the-art performance for many tasks (Bengio et aL, 2013). Deep 
learning generally refers to methods that map data through mul- 
tiple levels of abstraction, where higher levels represent more 
abstract entities. The goal is for an algorithm to automatically 
learn complex functions that map inputs to outputs, without 
using hand-crafted features or rules (Bengio, 2009). One imple- 
mentation of deep learning comes in the form of feedforward 
neural networks, where levels of abstraction are modeled by mul- 
tiple non-linear hidden layers. 

With the increasingly rapid growth in the volume of 'omic' 
data (e.g. genomics, transcriptomics, proteomics), deep learning 
has the potential to produce meaningful and hierarchical repre- 
sentations that can efficiently be used to describe complex bio- 
logical phenomena. For example, deep networks may be useful 
for modeling multiple stages of a regulatory network at the 
sequence level and at higher levels of abstraction. 

Ensemble methods are a class of algorithms that are popular 
owing to their generally good performance (Caruana and 
Niculescu-Mizil, 2006), and are often used in the life sciences 
(Touw et aL, 2013). The strength of ensemble methods comes 
from combining the predictions of many models. Random for- 
ests is an example, as is the Bayesian model averaging method 
previously used to model the regulation of splicing. Recently, 
neural network learning has been improved using a technique 
called dropout, which makes neural networks behave like an 
ensemble method (Hinton and Srivastava, 2012). Dropout 
works by randomly removing hidden neurons during the presen- 
tation of each training example. The outcome is that instead of 
training a single model with TV hidden variables, it approximates 
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the training of 2 different networks, each on a different subset 
of the training data. It is described as an 'extreme form of 
bagging' and is a computationally efficient way of doing model 
averaging (Hinton and Srivastava, 2012). 

With large datasets, learning with MCMC methods can be 
slow and can be outperformed by stochastic optimization meth- 
ods in practice (Ahn et al, 2012). These algorithms process small 
subsets (minibatches) of data at each iteration, and update model 
parameters by taking small steps in the direction of the gradient 
to optimize the cost function. It is common to use stochastic 
gradient descent to train feedforward neural networks. The 
learning algorithm (backpropagation) is also conceptually 
simple, involving for the most part matrix multiplications, 
which makes them suitable for speedup using graphics processing 
units (GPU). 

Here, we show that the use of large (many hidden variables) 
and deep (multiple hidden layers) neural networks can improve 
the predictive performances of the splicing code compared with 
previous work. We also provide an evaluation method for re- 
searchers to improve and extend computational models for pre- 
dicting AS. Another goal is to describe the procedure for training 
and optimizing a deep neural network (DNN) on a sparse and 
unbalanced biological dataset. Furthermore, we show how such 
a DNN can be analyzed in terms of its inputs. To date, aside 
from a small number of works (Di Lena et al, 2012; Eickholt 
and Cheng, 2012), deep learning methods have not been applied 
in the life sciences, even though they show tremendous promise. 

We show results supporting that DNN with dropout can be a 
competitive algorithm for doing learning and prediction on bio- 
logical datasets, with the advantage that they can be trained 
quickly, have enough capacity to model complex relationships 
and scale well with the number of hidden variables and volume 
of data, making them potentially highly suitable for 'omic' 
datasets. 

Different from the previous BNN, which used 30 hidden 
units, our architecture has thousands of hidden units with multiple 
non-linear layers and millions of model parameters 
(Supplementary Table S2). We also explored a different connec- 
tion architecture compared with previous work. Before, each 
tissue type was considered as a different output of the neural net- 
work. Here, tissues are treated as an input, requiring that the 
complexity of the splicing machinery in response to the cellular 
environment be represented by a set of hidden variables that 
jointly represent both the genomic features and tissue context. 

Besides a different model architecture, we also extended the 
code's prediction capability. In previous work, the splicing code 
infers the direction of change of the percentage of transcripts 
with an exon spliced in (PSI) (Katz et al, 2010), relative to all 
other tissues. Here, we perform absolute PSI prediction for each 
tissue individually without the need for a baseline averaged 
across tissues. We also predict the difference in PSI (APSI) be- 
tween pairs of tissues to evaluate the model's tissue-specificity. 
We show how these two prediction tasks can be trained simul- 
taneously, where the learned hidden variables are useful for both 
tasks. 

We compare the splicing code's performance trained with the 
DNN with the previous BNN and additionally optimized a 
MLR classifier on the same task for a baseline comparison. 
A GPU was used to accelerate training of the DNN, which 



made it feasible to perform hyperparameter search to optimize 
prediction performance with cross validation. 



2 METHODS 

2.1 Dataset 

The dataset consists of 11019 mouse alternative exons profiled from 
RNA-Seq data prepared by (Brawand et al, 2011). The exons are 
based on a set of cassette exons derived from EST/cDNA sequences 
(Barash et al., 2010, 2013). Five tissue types are available, including 
whole brain, heart, kidney, liver and testis. To estimate the splicing 
level for each exon and tissue, we mapped the 75 nucleotide reads to 
splice junctions, requiring a minimum overhang of 8 nucleotides on 
each side of the junction, giving 60 mappable positions. A bootstrap 
method that takes into account position-dependent read biases in 
RNA-Seq was then used to obtain an estimate of PSI that reflects its 
uncertainty (Kakaradov et al., 2012). This allows the generation of a 
distribution of PSI for each exon and tissue. To obtain the distribution 
denoting the difference in PSI, or APSI, the difference between bootstrap 
samples was calculated for all pairs of tissues to generate a distribution in 
the same manner. The possible values range from 0 to 1 for the PSI 
distribution, and -1 to 1 in the APSI distribution. Additional informa- 
tion about the dataset can be found in Section 1 of the Supplementary 
Material. To test whether the model generalizes to a different dataset, 
RNA-Seq data from (Barbosa-Morais et al., 2012) was processed in the 
same manner for brain and heart, which was used only for testing. 

For each exon, a set of intronic, exonic and structural features was 
derived from sequences in the alternative exon (A), flanking constitutive 
exons (CI and C2) and introns between CI and A (II) and A and C2 (12), 
forming a feature vector of length 1393. These features include those 
originally described in (Barash et al., 2010) and the extended feature set 
from (Barash et al, 2013). Features related to the premature termination 
codon have been removed because they rely on knowing the splicing 
pattern a priori and cannot be computed by just the local genomic se- 
quences. Instead, four binary 'translatability' features are introduced, 
which describe whether exons can be translated without a stop codon 
in one of three possible reading frames. The features are summarized in 
Section 4 of the Supplementary Material. 

2.2 The model 

We formulate splicing prediction as a classification problem with multiple 
classes. Figure 1 shows the architecture of the DNN. The parameters of 
the model are summarized in Section 2 of the Supplementary Material. 
The nodes of the neural network are fully connected, where each connec- 
tion is parameterized by a real-valued weight 6. The DNN has multiple 
layers of non-linearity consisting of hidden units. The output activation a 
of each hidden unit v in layer / processes a sum of weighted outputs from 
the previous layer, using a non-linear function f: 

where M l represents the number of hidden units in layer /, and a 0 and M° 
are the input into the model and its dimensionality, respectively. We used 
two different activation functions for the hidden units, namely the hyper- 
bolic tangent (TANH) function, and the rectified linear unit (RELU), 
which is defined as (Glorot et al., 2011): 

/R£LC/O0 = max(0, z) (2) 

The RELU unit was used for all hidden units except for the first hidden 
layer, which used TANH units, based on empirical performance on val- 
idation data. 
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Fig. 1. Architecture of the DNN used to predict AS patterns. It contains three hidden layers, with hidden variables that jointly represent genomic 
features and cellular context (tissue types) 



Inputs into the first hidden layer consist of F = 1393 genomic features 
Xf= i. . .f describing an exon, neighboring introns and adjacent exons. To 
improve learning, the features were normalized by the maximum of the 
absolute value across all exons. The purpose of this hidden layer is to 
reduce the dimensionality of the input and learn a better representation of 
the feature space. 

The identity of two tissues, which consists of two 1-of-T binary vari- 
ables ti=i...T and tj= L _ T , are then appended to the vector of outputs of 
the first hidden layer, together forming the input into the second hidden 
layer. For this work, T = 5 for the five tissues available in the RNA-Seq 
data. We added a third hidden layer as we found it improved the model's 
performance. The weighted outputs from the last hidden layer is used as 
input into a softmax function for classification in the prediction h k (x,t,6), 
which represents the probability of each splicing pattern k: 

" E*exp(£„X>?) 

To learn a set of model parameters 6, we used the cross-entropy cost 
function E on predictions h(x,t,6) given targets y(x,t), which is mini- 
mized during training: 

where n denotes the training examples, and k indexes C classes. 

We are interested in two types of predictions. The first task is to predict 
the PSI value given a particular tissue type and a set of genomic features. 
To generate the targets for training, we created C = 3 classes, which we 
label as low, medium and high categories. Each class contains a real- value 
variable obtained by summing the probability mass of the PSI distribu- 
tion over equally split intervals of 0-0.33, 0.33-0.66 and 0.66-1. They 
represent the probability that a given exon and tissue type has a PSI value 
ranging from these corresponding intervals, hence are soft class labels. 
We will refer this as the 'low, medium, high' (LMH) code, with targets 

The second task describes the APSI between two tissues for a particu- 
lar exon. We again generate three classes, and call them decreased inclu- 
sion, no change and increased inclusion, which are similarly generated, but 
from the APSI distributions. We chose an interval that more finely dif- 
ferentiates tissue-specific AS for this task, where a difference of >0.15 
would be labeled as a change in PSI levels. We summed the probability 
mass over the intervals of —1 to —0.15 for decreased inclusion, —0.15 to 
0.15 for no change and 0.15 to 1 for increased inclusion. The purpose of 



this target is to learn a model that is independent of the chosen PSI class 
intervals in the LMH code. For example, the expected PSI of two tissues 
ti and tj for an exon could be 0.40 and 0.60. The LMH code would be 
trained to predict medium for both tissues, whereas this tissue difference 
code would predict that tj has increased inclusion relative to tj. We will 
refer to this as the 'decrease, no change, increase' (DNI) code, with targets 
yf> m (x, t,, tj ). 

Both the LMH and DNI codes are trained jointly, reusing the same 
hidden representations learned by the model. For the LMH code, two 
softmax classification outputs predict the PSI for each of the two tissues 
that are given as input into the DNN. A third softmax classification 
function predicts the APSI for the two tissues. We note that two PSI 
predictions are included in the model's output so we have a complete 
set of predictions that use the full input features. The total cost of the 
model used during optimization is the sum of the cross-entropy functions 
(4) for both prediction tasks. 

The BNN architecture used for comparison is the same as previously 
described (Xiong et al., 2011), but trained on RNA-Seq data with the 
expanded feature set and LMH as targets. Although hidden variables 
were shared across tissues in both the BNN and DNN, a different set 
of weights was used following the single hidden layer to predict the 
splicing pattern for each tissue separately in the BNN (Supplementary 
Fig. S3). In the current DNN, the tissue identities are inputs and are 
jointly represented by hidden variables together with genomic features. 
For the BNN to make tissue difference predictions in the same manner as 
the DNI code, we fitted a MLR on the predicted LMH outputs for each 
tissue pair (Supplementary Fig. S4). 

2.3 Training the model 

The first hidden layer was trained as an autoencoder to reduce the dimen- 
sionality of the feature space in an unsupervised manner. An autoencoder 
is trained by supplying the input through a non-linear hidden layer, and 
reconstructing the input, with tied weights going into and out of the 
hidden layer. This method of pretraining the network has been used in 
deep architectures to initialize learning near a good local minimum 
(Erhan et al, 2010; Hinton and Salakhutdinov, 2006). We used an auto- 
encoder instead of other dimensionality reduction techniques like prin- 
ciple component analysis because it naturally fits into the DNN 
architecture, and that a non-linear technique may discover a better and 
more compact representation of the features. 

In the second stage of training, the weights from the input layer to the 
first hidden layer (learned from the autoencoder) are fixed, and 10 add- 
itional inputs corresponding to tissues are appended. A one-hot encoding 
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representation is used, such that specifying a tissue for a particular train- 
ing example can take the form [0 1 0 0 0] to denote the second tissue out 
of five possible types. We have two such inputs totaling 10 variables that 
specify tissue types. The reduced feature set and tissue variables become 
input into the second hidden layer. The weights connected to this and the 
final hidden layer of the DNN are then trained together in a supervised 
manner, with targets being PSI and APSI. After training these final two 
layers, weights from all layers of the DNN were fine-tuned by 
b ackpropagation . 

Each training example consists of 1393 genomic features and two 
tissue types as input. The targets consist of (i) PSI for each of the two 
tissues and (ii) APSI between the two tissues. Given a particular exon and 
five possible tissue types, we constructed 5 x 5 = 25 training examples. 
This construction has redundancy in that we generate examples where 
both tissues are the same in the input to teach the model that it should 
predict no change for APSI given identical tissue indices. Also, if the 
tissues are swapped in the input, a previously increased inclusion label 
should become decreased inclusion. The same rationale extends to the 
LMH code. Generating these additional examples is one method to in- 
corporate this knowledge without explicitly specifying it in the model 
architecture. 

We applied a threshold to exclude examples from training if the total 
number RNA-Seq junction reads is below 10. This removed 45.8% of the 
total number of training examples. We further define exons as having 
large tissue variability if APSI > ±0.15 for at least one tissue pair profiled. 
These exons make up 28.6% of the total number of remaining exons that 
have more than 10 junction reads. Additional information about the read 
coverage of the dataset can be found in Section 1 of the Supplementary 
Material. 

To promote the neural network to better discover the meaning of the 
inputs representing tissue types, we biased the distribution of training 
examples in the minibatches. We first selected all events which exhibit 
large tissue variability, and constructed minibatches based only on these 
events. At each training epoch, we further sampled (without replacement) 
training cases from the larger pool of events with low tissue variability, of 
size equal to one fifth of the minibatch size. The purpose is to have a 
consistent backpropagation signal that updates the weights connected to 
the tissue inputs and bias learning towards the event with large tissue 
variability early on before overfitting occurs. As training progresses, the 
splicing pattern of the events with low tissue variability is also learned. 
This arrangement effectively gives the events with large tissue variability 
greater importance (i.e. more weight) during optimization. A side effect is 
that it also places more importance to the medium category of the LMH 
code during training, since they tend to be present more often in exons 
with tissue-specific splicing patterns. 

Both the LMH and DNI codes are trained together. Because each of 
these two tasks might be learning at different rates, we allowed their 
learning rates to differ. This is to prevent one task from overfitting too 
soon and negatively affecting the performance of another task before the 
complete model is fully trained (Silver and Mercer, 1996). This is imple- 
mented by having different learning rates for the weights between the con- 
nections of the last hidden layer and the softmax functions for each task. 

The performance of the model was assessed using the area under the 
Receiver- Operating Characteristic curve (AUC) metric. To evaluate the 
PSI predictions for the LMH code, we used the 1 versus all formulation. 
This produces three AUCs (AUC Low , AUC Me d, AUC H i g h), one for each 
class. For APSI predictions, since the no change class is much more abun- 
dant, we find that the multi-class 1 versus all formulation tends to over- 
estimate the tissue-specificity performance of the model due to class skew 
(Fawcett, 2006). Furthermore, the model can predict, based on the gen- 
omic features alone, that there is tissue-specific splicing for a given exon 
(which is biologically meaningful), but not necessarily how different tissues 
change the splicing pattern. We therefore provide two metrics to evaluate 
the DNI code. The first is to compute the AUC DvI based on the decrease 



versus increase class between two tissues. The second is to compute 
AUCchange by comparing no change versus the other two classes. 

To train and test the DNN, data was split into approximately five 
equal folds at random for cross validation. Each fold contains a unique 
set of exons that are not found in any of the other folds. Three of the 
folds were used for training, one used for validation and one held out for 
testing. We trained for a fixed number of epochs and selected the hyper- 
parameters that give the optimal AUC performance on the validation 
data. The model was then retrained using these selected hyperparameters 
with both the training and validation data. Five models were trained this 
way from the different folds of data. Predictions from all five models on 
their corresponding test set were used to evaluate the code's performance. 
To estimate the confidence intervals, the data were randomly partitioned 
five times, and the above training procedure was repeated. 

The DNN weights were initialized with small random values sampled 
from a zero-mean Gaussian distribution. Learning was performed with 
stochastic gradient descent with momentum and dropout, where mini- 
batches were constructed as described above. A small LI weight penalty 
was included in the cost function (4) (Tibshirani, 1994). The model's 
weights were updated after each minibatch. The learning rate s was 
decreased with epochs e, and also included a momentum term /x that 
starts out at 0.5, increasing to 0.99, and then stays fixed. The momentum 
term accelerates learning, and stabilizes learning near the end of training 
when the momentum is high by distributing gradient information over 
many updates. The weights of the model parameters 6 were updated as 
follows: 

e e =e e -i+Ae e 

(5) 

Ae e = fz e A0 e -i - (1 - fz e )8 e WE(6 e ) 

We used a dropout rate of 50% for all layers except for the input layer 
(the autoencoder), where we did not use dropout, as it empirically 
decreased the model's predictive performance. Training was carried out 
for 1500 epochs for both the pretraining with the autoencoder and super- 
vised learning. 

The performance of a DNN depends on a good set of hyperpara- 
meters. Instead of doing a grid search over the hyperparameter space, 
we used a Bayesian framework called spearmint to automatically select 
the model's hyperparameters (Snoek et al, 2012). The method uses a 
Gaussian Process to search for a joint setting of hyperparameters that 
optimizes an algorithm's performance on validation data. It uses the 
performance measures from previous experiments to decide which hyper- 
parameters to try next, taking into account the trade-off between explor- 
ation and exploitation. This method eliminates many of the human 
judgments involved with hyperparameter optimization and reduces the 
time required to find such hyperparameters. The algorithm requires only 
the search range of hyperparameter values to be specified, as well as how 
long to run the optimization for. We used the expected improvement 
criterion in the optimization, as it does not require its own tuning par- 
ameter, unlike other methods in the framework. We score each experi- 
ment by the sum of the AUCs from both the LMH and DNI codes, 
requiring the set of hyperparameters to perform well on both tasks. 
Detailed information on the selected hyperparameters and search proced- 
ure are described in Section 2 of the Supplementary Material. 

The DNN was implemented in Python, making use of Gnumpy for 
GPU- accelerated computation (Tieleman, 2010). The GPU used was a 
Nvidia GTX Titan. For the configuration with the optimal hyperpara- 
meters, the GPU provided ~ 15-fold speedup over our original CPU im- 
plementation. This was crucial as otherwise hyperparameter optimization 
would not have been practical. 

We compared the splicing code's performance trained with the DNN 
with the BNN, as well as an MLR classifier as a baseline. The MLR was 
trained by removing the hidden layer while keeping the training method- 
ology identical to the neural networks. Because the predictions of the 
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BNN consist only of the PSI prediction for each tissue separately at the 
output (Xiong et al., 2011), for the BNN to make tissue difference pre- 
dictions in the same manner as the DNI code, we used a MLR on the 
predicted outputs for each tissue pair. For a fair comparison, we similarly 
trained a MLR on the LMH outputs of the DNN to make DNI predic- 
tions, and report that result separately. In both cases, the inputs to the 
MLR are the LMH predictions for two tissues as well as their logarithm. 
Schematic of the BNN and MLR architecture can be found in 
Supplementary Figures S3 and S4. 

3 RESULTS 

We present three sets of results that compare the test perform- 
ance of the BNN, DNN and MLR for splicing pattern predic- 
tion. The first is the PSI prediction from the LMH code tested on 
all exons. The second is the PSI prediction evaluated only on 
targets where there are large variations across tissues for a given 
exon. These are events where APSI > ±0.15 for at least one pair 
of tissues, to evaluate the tissue specificity of the model. The 
third result shows how well the code can classify APSI between 
the five tissue types. Hyperparameter tuning was used in all 
methods. The averaged predictions from all partitions and 
folds are used to evaluate the model's performance on their cor- 
responding test dataset. Similar to training, we tested on exons 
and tissues that have at least 10 junction reads. 

For the LMH code, as the same prediction target can be gen- 
erated by different input configurations, and there are two LMH 
outputs, we compute the predictions for all input combinations 
containing the particular tissue and average them into a single 
prediction for testing. To assess the stability of the LMH predic- 
tions, we calculated the percentage of instances in which there is 
a prediction from one tissue input configuration that does not 
agree with another tissue input configuration in terms of class 
membership, for all exons and tissues. Of all predictions, 91.0% 
agreed with each other, 4.2% have predictions that are in adja- 
cent classes (i.e. low and medium, or medium and high), and 4.8% 
otherwise. Of those predictions that agreed with each other, 
85.9% correspond to the correct class label on test data, 
51.2% for the predictions with adjacent classes and 53.8% for 
the remaining predictions. This information can be used to assess 
the confidence of the predicted class labels. Note that predictions 
spanning adjacent classes may be indicative that the PSI value is 
somewhere between the two classes, and the above analysis using 
hard class labels can underestimate the confidence of the model. 

3.1 Performance comparison 

Table la reports AUC LM h_ah for PSI predictions from the LMH 
code on all tissues and exons. The performance of the DNN in 
the low and high categories are comparable with the BNN, but 
excels at the medium category, with especially large gains in 
brain, heart and kidney. Because a large portion of the exons 
exhibit low tissue variability (Section 1 of Supplementary 
Material), evaluating the performance of the model on all 
exons may mask the performance gain of the DNN. This as- 
sumes that exons with high tissue variability are more difficult 
to predict, where a computational model must learn how AS 
interprets genomic features differently in different cellular envir- 
onments. To more carefully see the tissue specificity of the dif- 
ferent methods, Table lb reports AUC LM h_tv evaluated on the 



Table 1. Comparison of the LMH code's AUC performance on different 
methods 
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85.6 zb 0.1 


72.3 ±0.4 


85.2 ±0.1 




BNN 


91.1 ±0.3 


75.5 ± 0.6 


90.4 ± 0.3 




DNN 


90.7 ± 0.6 


76.6 ± 0.7 


89.7 ± 0.7 


(b) AUC L mh 


TV 








Tissue 


Method 


Low 


Medium 


High 


Brain 


MLR 


71.1 ±0.2 


58.8 ±0.2 


70.8 ±0.1 




BNN 


77.9 ±0.5 


61.1 ±0.5 


76.5 ±0.7 




DNN 


82.8 ±1.0 


69.5 ±1.1 


81.1 ±0.4 


Heart 


MLR 


73.9 ±0.3 


58.6 ±0.4 


72.7 ±0.1 




BNN 


78.1 ±0.3 


58.9 ±0.3 


75.7 ±0.3 




DNN 


82.0 ±1.1 


67.4 ± 1.3 


79.7 ± 1.2 


Kidney 


MLR 


79.7 ±0.3 


64.3 ±0.2 


79.4 ±0.2 




BNN 


83.9±0.5 


66.4 ±0.5 


83.3 ±0.6 




DNN 


86.2 ±0.6 


73.2 ±1.3 


85.3 ±1.2 


Liver 


MLR 


80.1 ±0.5 


63.7 ±0.3 


79.4 ±0.3 




BNN 


84.9 ±0.7 


65.4 ±0.7 


84.4 ±0.7 




DNN 


87.7 ±0.6 


69.4 ±1.2 


84.8 ±0.8 


Testis 


MLR 


77.3 ±0.2 


60.8 ±0.3 


77.0 ±0.1 




BNN 


81.1 ±0.5 


63.9 ±0.9 


81.0±0.5 




DNN 


84.6 ±1.1 


67.8 ±0.9 


83.5 ±0.9 



Notes: ± indicates 1 standard deviation; top performances are shown in bold. 



subset of events that exhibit large tissue variability. Here, the 
DNN significantly outperforms the BNN in all categories and 
tissues. The improvement in tissue specificity is evident from the 
large gains in the medium category, where exons are more likely 
to have large tissue variability. In both comparisons, the MLR 
performed poorly compared with both the BNN and DNN. 

Next, we look at how well the different methods can predict 
APSI between two tissues, where it must determine the direction 
of change. This is shown in Table 2. As described above, APSI 
predictions for the BNN were made by training a MLR classifier 
on the LMH outputs (BNN-MLR). To make the comparison 
fair, we included the performance of the DNN in making APSI 
predictions by also using a MLR classifier (DNN-MLR) on the 
LMH outputs. Finally, we evaluated the APSI predictions dir- 
ectly from the DNI code, as well as the MLR baseline method, 
where the inputs include the tissue types. 
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Table 2. Comparison of the DNI code's performance in terms of the AUC for decrease versus increase (AUC DvI ) and change versus no change 

(AUCchange) 



(a) AUC DvI (b) AUCcham 



Method Brain Brain Brain Brain Heart Heart Heart Kidney Kidney Liver Change 
versus versus versus versus versus versus versus versus versus versus versus 
Heart Kidney Liver Testis Kidney Liver Testis Liver Testis Testis No change 



MLR 50.3±0.2 48.8±0.8 48.3±1.1 51.2±0.5 50.0 ±1.5 47.8 ±1.7 51.1 ±0.5 49.4±0.8 51.9±0.5 51.3±0.6 74.7±0.1 

BNN-MLR 65.3 ±0.3 73.7 ±0.2 69.1 ±0.4 72.9 ±0.5 72.6 ±0.3 66.7 ±0.4 68.3 ±0.7 54.7 ±0.6 65.0 ±0.8 65.0 ±0.9 76.6 ±0.8 

DNN-MLR 77.9±0.1 83.0±0.1 81.6±0.1 82.3±0.2 82.4±0.1 81.3±0.1 82.4±0.1 76.8±0.5 79.9±0.2 79.1 ±0.1 79.9±0.8 

DNN 79.4 ±0.7 83.3 ±0.8 82.5 ±0.6 82.9 ±0.7 86.1 ±1.0 85.1 ±1.1 84.8 ±0.8 76.2 ±1.0 82.5 ±1.0 81.8 ±1.3 86.5 ±1.0 



Note: ± indicates 1 standard deviation; top performances are shown in bold. 

Table 2a shows the AUC DvI for classifying decrease versus 
increase inclusion for all pairs of tissue. Both the DNN-MLR 
and DNN outperform the BNN-MLR by a good margin. 
Comparing the DNN with DNN-MLR, the DNN shows some 
gain in differentiating brain and heart AS patterns from other 
tissues. The performance of differentiating the remaining tissues 
(kidney, liver and testis) with each other is similar between the 
DNN and DNN-MLR. We note that the similarity between the 
DNN and DNN-MLR in terms of performance can be due to 
the use of soft labels for training. Using MLR directly on the 
genomic features and tissue types performs rather poorly, where 
predictions are no better than random. 

The models are further evaluated on predicting whether there 
is a difference in splicing patterns for all tissues, without specify- 
ing the direction. AUC C han g e is computed on all exons and tissue 
pairs. This is shown in Table 2b. The results indicate that this is a 
less demanding task, as the models can potentially use just the 
genomic features to determine whether an exon will have tissue 
variability. The difference in performance between all methods is 
less compared with AUC DvI . However, as the evaluation is over 
all pairs of tissues, the DNN, which has access to the tissue types 
in the input, does significantly better. Although this is also true 
for the MLR, it still performed worst overall. This suggests that 
in the proposed architecture where tissue types are given as an 
input, the MLR lacks the capacity to learn a representation that 
can jointly use tissue types and genomic features to make pre- 
dictions that are tissue- specific. Both results from Table 2 show 
that there is an advantage to learning a DNI code rather than 
just learning the LMH code. 

To test whether the predictions generalize to RNA-Seq data 
from a different experiment, we selected data for two mouse tis- 
sues, namely the brain and the heart, from (Barbosa-Morais et al, 
2012), and analyzed how our model, which is trained with data 
from (Brawand et al, 2011), performs. Table 3 shows the set of 
evaluations on the DNN identical to that of Tables 1 and 2, tested 
on this RNA-Seq data. For the brain, there is an ~l^t% decrease 
in AUC LM h_aii and ~4— 5% decrease for AUC LM h_tv- For the 
heart, the model's performance on both dataset is equivalent to 
within 1 standard deviation for both AUC LM h_ah and 
AUC LM h_tv- A decrease in performance of ~7% is observed in 
AUC DvI for brain versus heart. There is an increase in AUC C han ge 
but that is owing to only two tissues being evaluated as opposed to 



Table 3. Performance of the DNN evaluated on a different RNA-Seq 
experiment 



(a) AUC L mh 


_A11 








Tissue 




Low 


Medium 


High 


Brain 
Heart 




88.1 ±0.5 
90.7 ±0.5 


76.1 ±1.0 
78.4±1.3 


87.0 ±0.6 
89.0 ±1.0 


(b) AUC L mh 


TV 








Tissue 




Low 


Medium 


High 


Brain 
Heart 




79.1 ±0.9 
82.6 ±1.0 


66.1 ±1.0 
65.3 ±1.2 


77.6 ±0.8 
78.8±1.1 


(c) AUClmh 


TV 




(d) AUCchange 




Method 




Brain versus 
Heart 


Method 


Change versus 
No Change 


DNN-MLR 
DNN 




72.9 ±0.1 
74.2 ±1.5 


DNN-MLR 
DNN 


81.7±1.0 
91.9±0.7 



five, where the AUC would be pulled down by the other tissues 
with lower performances if they were present. 

Overall, the decrease in performance is not unexpected, owing 
to differences in PSI estimates from variations in the experimen- 
tal setup. To see how PSI differed, we computed the expected PSI 
values for brain and heart in all exons from both sets of experi- 
ments, and evaluated their Pearson correlation. For the brain, 
the correlation is 0.945, and for the heart, it is 0.974. This can 
explain why there is a larger decrease in performance for brain, 
which is a particularly heterogeneous tissue, and hence can vary 
more between experiments depending on how the samples were 
prepared. We note that the performance of the DNN on this 
dataset is still better than the BNN's predictions on the original 
dataset. Viewed as a whole, the results indicate that our model 
can indeed be useful for splicing pattern predictions for PSI 
estimates computed from other datasets. It also shows that our 
RNA-Seq processing pipeline is consistent. 
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We believe there are several reasons why the proposed 
DNN has improved predictive performance in terms of tissue- 
specificity compared with the previous BNN splicing code. One 
of the main novelties is the use of tissue types as an input feature, 
which stringently required the model's hidden representations be 
in a form that can be well-modulated by information specifying 
the different tissue types for splicing pattern prediction. This re- 
quirement would be relaxed if each tissue was trained separately. 
Furthermore, this hidden representation is described by thou- 
sands of hidden units and multiple layers of non-linearity. In con- 
trast, the BNN only has 30 hidden units to represent the variables 
that can be used by the model to modulate splicing based on the 
cellular condition, which may not be sufficient. Another difference 
is that for the DNI code, a training objective was specifically de- 
signed to learn to predict APSI, which is absent from the BNN. 
However, the performance gain of the DNN-MLR over BNN- 
MLR shows that this is only part of the improvement. 

In addition, we performed hyperparameter search to optimize 
the DNN, where we gained considerable improvements over our 
original hand- tuned models, at ~4.5% for the DNI code and 
~3.5% for the LMH code. Interestingly, the final set of hyper- 
parameters found opt for a much larger (~4x) number of hidden 
units than our initial attempt (with matching hyperparameters). 
Manually trying to find these hyperparameters would have been 
difficult, where a user may settle for a suboptimal set of hyper- 
parameters owing to the substantial effort and time required for 
training a model with millions of parameters. 

Another performance boost came from the use of dropout, 
which contributed ~l-6% improvement in the LMH code for 
different tissues, and ~2-7% in the DNI code, compared with 
without. The performance difference would likely be larger if 
hyperparameter optimization were not performed on the model 
that did not use dropout. We note also that even with dropout, a 
small LI weight penalty was found to be beneficial, which may 
explain our model's tolerance for a large number of hidden units 
with sparse weights. 

One additional difference compared with previous work is that 
training was biased toward the tissue- specific events (by con- 
struction of the minibatches), thereby promoting the model to 
learn a good representation about cellular context. We were able 
to get some small performance gains (within 2 standard devi- 
ations) of ~ 1-2% in AUClmhtv an d AUC DV i using this meth- 
odology compared with training with all events treated equally. 
More importantly, biasing the training examples encourages the 
model to learn about the tissues as input, which has a signifi- 
cantly different meaning compared with the genomic features 
and make up only a small number of the input dimension. We 
find that without this learning bias, the model more frequently 
settles to a bad local minimum, or does not learn to use the 
tissues as input at all. Together, all these changes allowed us to 
train a model that significantly improves on previous work. 

With regards to training the two tasks jointly, we found that 
with hyperparameter tuning, the performance of the model when 
each task was trained separately compared with being trained 
together was not statistically different. This is likely because 
both tasks are too similar for any transfer learning to take 
place, as evident by the similarity in performance in the DNI 
code between the DNN and DNN-MLR models. Nevertheless, 
we find that training both codes together stabilizes learning, 



specifically, training becomes more tolerant to a larger range of 
hyperparameters leading to reduced variance between models. 

3.2 Model and feature analysis 

A major contribution to the success of splicing pattern predic- 
tions that generalize well comes from the richness of our feature 
set. For example, we observed a significant decrease in the per- 
formance of the splicing code if the reduced feature vector di- 
mension is too small by either principle component analysis or an 
autoencoder with small number of hidden units. We found that 
the performance of both the LMH code and the DNI code drops 
by up to 4% when the reduced dimension is at 150 (down from 
1393). This suggests a sufficiently large number of hidden vari- 
ables denoting genomic features are required to interact with 
tissue inputs to achieve good performance. 

It can be useful to see how the genomic features are used by 
the DNN to perform splicing pattern predictions. We analyzed 
our model in two different ways. 

In the first method, to see which feature types are important to 
the model, we substituted genomic features to their median across 
all exons and looked at how the predictive performance changed. 
We divided the full feature set into 55 groups based on what they 
represent. The grouping, along with additional descriptions, can 
be found in Section 4 of the Supplementary Material. Here, the 
performance measure is defined as the sum of the three classes 
from AUC LM h_aii- The decrease in test performance (as a fraction 
of that obtained with the full feature set) when each group of 
features is substituted by their median is shown in Figure 2. 
Feature groups that cause large decrease in performance are pre- 
sumably important to the splicing code. The standard deviation is 
computed from the five trained models with random partitions of 
the data as described above. The order of the feature group to- 
ward the right of the plot should not be used to determine their 
order of importance owing to the small difference they make to the 
model relative to their standard deviations. It is interesting to see 
how small the decrease in AUC is when each feature group is 
effectively removed. Many features contain redundant informa- 
tion and therefore can compensate for missing features from other 
groups. For example, some of the motifs for splicing factors are 
represented in features representing /i-mer counts. The most influ- 
ential features describe the translatability of the exon, conserva- 
tion scores and whether the alternative exon introduces a frame 
shift. The feature groups corresponding to counts of 3-mers and 5- 
mers are also important. 

To examine how each individual feature affects the DNN's pre- 
dictions, we adapted the method from (Simony an et al, 2014). 
Briefly, examples from the dataset are given as input to the trained 
model and forward propagated through the neural network. At 
the output, the target is modified to a different value, for example, 
in classification, by changing the class label. The error signal is 
then backpropagated to the inputs. The resulting signal describes 
how much each input feature needs to change to make the mod- 
ified prediction, as well as the direction. The computation is ex- 
tremely quick, as it only requires a single forward and backward 
pass through the DNN, and all examples can be calculated in 
parallel. We used this procedure on exons with low tissue variabil- 
ity, and modified the low PSI targets to high, and the high PSI 
targets to low. Table 4 lists the top 25 features with the largest 
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Fig. 2. Plot of the change in AUC LMH A11 by substituting the values in each feature groups by their median. Feature groups that are more important to 
the predictive performance of the model have lower values. The groups are sorted by the mean over multiple partitions and folds, with the standard 
deviations shown. The number of features for each feature group are indicated in brackets 



backpropagated signal magnitude (which indicate that these fea- 
tures need to change the least to affect the prediction the most, and 
are hence important; note also that all of our features are normal- 
ized). The table also indicates general trends in the direction of 
change for each feature over the dataset. If more than 5% of the 
examples do not follow the general direction of change, it is indi- 
cated by both an up and down arrow. Some of the splicing rules 
inferred by the model can be seen. For example, presence of spli- 
cing silencers inhibits the splicing of the alternative exon leading to 
higher inclusion, a shorter alternative exon is more likely to be 
spliced out, and the strength and position of acceptor and donor 
sites can lead to different splicing patterns. 

Next, we wanted to see how features are used in a tissue- specific 
manner. Using the set of exons with high tissue variability, we 
computed the backpropagation signal to the inputs with the 
output targets changed in the same manner as above, for each 
tissue separately. Figure 3 shows the sum of the magnitudes of 
the gradient, normalized by the number of examples in each tissue 
for the top 50 features. We can observe that the sensitivity of each 
feature to the model's predictions differs between tissues. The 
profile for kidney and liver tend to be more similar with each 
other than others, which associates well with the model's weaker 
performance in differentiating these two tissues. This figure also 
provides a view of how genomic features are differentially used by 
the DNN, modulated by the input tissue types. In both Table 4 
and Figure 3, the backpropagation signals were computed on ex- 
amples from the test set, for all five partitions and folds. 

4 CONCLUSIONS 

In this work, we introduced a computational model that extends 
the previous splicing code with new prediction targets and im- 
proved tissue-specificity, using a learning algorithm that scales 
well with the volume of data and the number of hidden variables. 
The approach is based on DNNs, which can be trained rapidly 
with the aid of GPUs, thereby allowing the models to have a 



Table 4. The top 25 features (unordered) of the splicing code that de- 
scribes low and high percent inclusion 



Feature description 


Low 


High 


Strength of the 11 acceptor site 




t 


Strength of the 12 donor site 




t 


Strength of the 11 donor site 


t 




Mean conservation score of first 100 bases in 3' end of 11 


H 


u 


Mean conservation score of first 100 bases in 5' end of 12 


H 


u 


Counts of Burge's exonic splicing silencer in A 




t 


Counts of Chasin's exonic splicing silencer in A 




t 


Log base 10 length of exon A 




t 


Log base 10 length ratio between A and 12 




t 


Whether exon A introduces frame shift 


H 


u 


Predicted nucleosome positioning in 3' end of A 


H 


H 


Frequency of AGG in exon A 


t 




Frequency of CAA in exon A 




t 


Frequency of CGA in exon A 




t 


Frequency of TAG in exon A 


t 




Frequency of TCG in exon A 




t 


Frequency of TTA in exon A 


t 




Translatability of CI -A 




t 


Translatability of C1-A-C2 




t 


Translatability of C1-C2 


t 




Counts of Yeo's 'GTAAC motif cluster in 5' end of 12 




t 


Counts of Yeo's 'TGAGT' motif cluster in 5' end of 12 




t 


Counts of Yeo's 'GTAGG motif cluster in 5' end of 12 




t 


Counts of Yeo's 'GTGAG motif cluster in 5' end of 12 




t 


Counts of Yeo's 'GTAAG motif cluster in 5 ; end of 12 


1 


t 



Note: The direction of the arrows indicate that a feature's value should in general be 
increased (f) or decreased (l) to change the PSI predictions to low or high. Feature 
details can be found in Section 4 of the Supplementary Material. 



large set of parameters and deal with complex relationships pre- 
sent in the data. We demonstrate that deep architectures can be 
beneficial even with a sparse biological dataset. We further 
described how the input features can be analyzed in terms of 
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Brain 



Kidney 



I 




Fig. 3. Magnitude of the backpropagated signal to the input of the top 50 features computed when the targets are changed from low to high, and high to 
low. White indicates that the magnitude of the signal is large, meaning that small perturbations to this input can cause large changes to the model's 
predictions. The features are approximately sorted left to right by the magnitude 



the predictions of the model to gain some insights into the 
inferred tissue-regulated splicing code. 

Our architecture can easily be extended to the case of more 
data from different sources. For example, using the same archi- 
tecture, we may be able to learn a hidden representation 
that spans additional tissue types as well as multiple species. 
Through transfer learning, training such model with multiple 
related targets might be beneficial particularly if the number of 
training examples in certain species or tissues is small. 
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